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Abstract 

Computational grids have die potential for solving large-scale scientific problems using heteroge- 
neous and geographically distributed resources. However, a number of major technical hurdles must be 
overcome before this potential can be realized. One problem that is critical to effective utilization of 
computational grids is the efficient scheduling of jobs. This work addresses this problem by describing 
and evaluating a grid scheduling architecture and three job migration algorithms. The architecture is 
scalable and does not assume control of local site resources. The job migration policies use the avail- 
ability and performance of computer systems, the network bandwidth available between systems, and 
the volume of input and output data associated with each job. An extensive performance comparison is 
presented using real workloads from leading computational centers. The results, based on several key 
metrics, demonstrate that the performance of our distributed migration algorithms is significantly greater 
than that of a local scheduling framework and comparable to a non-scalable global scheduling approach. 


1 Introduction 

One of the primary goals of grid computing [1, 6] is to share access to geographically distributed heteroge- 
neous resources in a transparent maimer. There will be many benefits when this goal is realized, including 
the ability to execute applications whose computational requirements exceed local resources and the reduc- 
tion of job turnaround time through workload balancing across multiple computing facilities. The develop- 
ment of computational grids and the associated middleware has therefore been actively pursued in recent 
years. However, many major technical (and political) hurdles stand in the way of realizing these benefits. 
Among the myriad research issues to be addressed is the problem of distributed resource management and 
job scheduling for computational grids. Although numerous researchers have proposed scheduling algo- 
rithms for parallel architectures [3, 4, 5, 7, 9, 11], the problem of scheduling jobs in a heterogeneous grid 
environment is fundamentally different. This is the focus of our work in this paper. 

Our approach to this problem begins with defining a grid scheduling architecture that consists of au- 
tonomous local schedulers that schedule access to computer systems and grid schedulers, paired with local 
schedulers, that send jobs to local schedulers and migrate jobs between grid schedulers. It is important that 
grid scheduling be distributed for scalability and fault tolerance and it is important that local schedulers have 
control of local resources so that grid scheduling will be accepted by the owners of the computer systems. 
Our grid scheduling architecture is presented in Section 2. 
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Second, we propose algorithms for migrating jobs between grid schedulers. These algorithms try to 
migrate jobs when the wait times of compute servers rises above or falls below specific thresholds. These 
migration algorithms decide whether to send or receive jobs using the requirements of each job (number of 
CPUs, wallclock time, amount of input data, and amount of output data) the availability and performance 
of computer systems, and the expected network bandwidth available between systems. Our job migration 
algorithms are presented in Section 3. 

Third, we evaluate our grid scheduling algorithms by simulating compute servers, networks, and sched- 
ulers and driving these simulations using workloads derived from trace data gathered from leading compu- 
tational centers. We gather several key performance metrics during these simulations and use these metrics 
to compare the performance of our algorithms and reference local and centralized scheduling algorithms. 
The methodolgy we use to gather performance data is presented in Section 4. Our experiments show that 
one of our algorithms has slightly lower turn-around times than our others and that these times are 47% less 
than if no grid scheduling is performed. Further, we find that for our experiments with larger data sizes and 
lower network bandwidths, ignoring data transfers when making migration decisions can result in 690% 
higher turn-around times. The results of our simulations and an evaluation of these results are presented in 
Section 5. Finally, we present conclusions and future work in Section 6. 

2 Grid Scheduling Architecture 

We use a common grid scheduling architecture, shown in Figure 1, for the grid scheduling algorithms that we 
propose. The architecture is composed of distributed compute servers, local schedulers with local queues, 
and grid schedulers with grid queues. A local job is submitted to a local scheduler (LS) which places the 
job in it’s local queue (LQ). The local scheduler removes jobs from the local queue and executes them 
on the local compute server. A grid job is submitted to a grid scheduler which places the job in it’s grid 
queue (GQ). A grid scheduler gathers information from it’s local scheduler and it’s peer grid schedulers 
and decides whether to send jobs to the local scheduler, send jobs to other grid schedulers, or request jobs 
from other grid schedulers. One issue which we do not address in this work is how grid schedulers locate 
their peer grid schedulers. We expect that traditional peer-to-peer (P2P) peer location approaches that use 
centralized or distributed indexes can be used and we plan to examine this issue in future work. 

There are a variety of grid scheduling architectures that we could have adopted. A centralized architec- 
ture with a single scheduler for multiple computer systems might be a good choice for a relatively small set 
of computer systems on a single machine room floor, but this approach won’t scale and is not fault tolerant 
in a geographically distributed environment. A hierarchy where grid schedulers are organized into a tree 
and jobs flow up and down the tree [8] is an interesting approach, but we do not expect it to scale as well as 
a P2P approach. A variation of our architecture is one in which the local scheduler and grid scheduler are 
combined into a single scheduler. This is starting to occur as scheduling vendors adopt a grid approach to 
scheduling [12, 10], but these systems don’t interoperate and are not yet widely used. Another approach to 
grid scheduling is where local scheduling is performed as usual but grid users use user-level grid schedulers 
to select which local schedulers to submit applications to [2]. This approach is very similar to our P2P ap- 
proach, the difference being that user-level grid schedulers are seeking to optimize the execution of jobs for a 
single user while our grid schedulers are seeking to optimize the execution of all jobs. We believe this subtle 
difference results in the P2P grid scheduling approach having greater potential scheduling performance. In 
the end, we chose a P2P architecture with a grid scheduler co-located with each local scheduler. We believe 
that this approach [13] gives us the best potential scalability, fault tolerance, and scheduling performance 
without requiring that sites replace their local schedulers. 
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Figure 1: Our grid scheduling architecture. Solid arrows represent movement of jobs, dashed arrows repre- 
sent transfer of information. 


3 Grid Scheduling Algorithms 

This section presents the three distributed scheduling algorithms that are the subject of this work and two 
reference algorithms. Our distributed scheduling algorithms are the sender-initiated , receiver-initiated , and 
symmetrically-initiated algorithms. These algorithms operate in a P2P manner and use different strategies 
for migrating jobs between grid schedulers. Our two reference algorithms that we use for comparison are 
a centralized algorithm that uses a single grid scheduler that interacts with all local schedulers and a local 
algorithm that has no grid schedulers and executes all jobs on the compute server where they were submitted. 

3.1 Distributed Algorithms 

Our three distributed algorithms are based around common steps: 

1. A job j is submitted to a grid scheduler on compute server s and is placed in the associated grid queue. 

2. The grid scheduler asks the local scheduler on s for the approximate wait time (AWT) of the job. The 
approximate wait time is the amount of time the local scheduler estimates job j 9 if submitted to it, 
will wait in the local queue before it begins executing. The AWT is computed by simulating the local 
scheduling algorithm using the local jobs that are either running or waiting in the local queue and 
the job j. If the local scheduler cannot satisfy the resource requirements of j, an AWT of infinity is 
returned. 

3. The grid scheduler tests the approximate wait time for j against a threshold 0. If the AWT is less than 
<t>, j is sent directly to the local scheduler for execution on s . If the AWT is at least <j>, the job is kept 
in the grid queue and one of our job migration algorithms is invoked. 

3.1.1 Sender-Initiated 

In the sender-initiated (S-I) strategy, the grid scheduler sends the resource requirements of the job to it’s 
peers. In this study, we only consider the CPU and run time requirements of each job; however, this can be 
extended to an arbitrary number of resource constraints. In response to the query, each peer grid scheduler 
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returns the approximate turnaround time (ATT) for the job and the resource utilization (RU) of the compute 
server associated with grid scheduler ATT is an estimate of the amount of time it will take to complete a job 
and the ATT for a job j on computer server s exec that is initially submitted to a grid scheduler on compute 
server Si n u is derived in the following way: 

ATT(j , Sexec) = max(AWT(j, s exec ), ADT(j in , s im t , s exec ))-\-ERT(j, Se X ec) J fADT(j mLt -> $exec-> Sinit )) 

Before a job begins to execute, it needs to both wait in a local queue and transfer input data to the system 
where it will execute. AWT(j } s exec ) is the approximate wait time of job j on s exec and ADT(ji ny Sinit j S eX ec) 
is the approximate data transfer time (ADT) of the input data of j from Sinit to s exec . We assume that these 
activities can be performed simultaneously so the maximum of the two constrains when the job can be- 
gin executing. The job then executes on s^c with an expected run time of ERT(j > s exec ) and the output 
data is transferred from where it executed to the compute server where it was initially submitted in time 
ADT(jouu Sexec , Sinit )• Note that the expected run time can vary from one compute server to another de- 
pending on their architectural designs and program characterizations. We simplify the calculation of ERT 
by assuming that run time is only related to the clock frequency of the compute server. 

Resource utilization is the fraction of the computer server that is currently being utilized. We assume 
our compute servers have multiple CPUs that are space shared so we calculate RU as the number of CPUs 
assigned to jobs divided by the total number of CPUs. If certain peer grid schedulers do not respond within 
a specified time limit due to traffic congestion or machine failure, they are simply ignored for that request. 

Based on the collected information, the grid scheduler calculates the potential turnaround cost (TC) 
of itself and each partner To compute the optimal TC, first the minimum approximate turnaround time is 
found. If the minimum ATT is within a small tolerance e for multiple machines, the system with the lowest 
resource utilization is chosen to accept the job. Thus the TC metric attempts to minimize the user’s time- 
to-solution, while using system utilization as a tiebreaker. We found this approach to be more effective then 
simply relying on ATT. The job is then sent to the local scheduler (by way of it’s partner grid scheduler) 
on the computer server with the minimal turnaround cost. Note that once a job enters a local queue, it will 
be scheduled and run based exclusively on the policy of the local scheduler, and can no longer migrated to 
another site. 

3.1.2 Receiver-Initiated 

The receiver-initiated (R-I) algorithm takes a more passive approach to job migration than the S-I strategy. 
Here, each system in the computational grid checks its own resource usage periodically at time interval <j. 

If the RU is below a certain threshold <5, the machine volunteers itself for receiving jobs by informing its 
partner set of its low utilization. Once a peer grid scheduler (say, GS P ) receives this information, it checks 
its grid queue for the first job waiting to be scheduled. If a job is indeed queued, its resource requirements 
are sent to the volunteer node. The underutilized system then responds with the job’s ATT, as well as its own 
RU. Based on this data, GS P computes and compares the turnaround cost between itself and the volunteer 
system. If the TC of the volunteer is lower than that of GS P , the job is transferred to the LQ of that system 
through the GM, Otherwise, it continues to wait in the GQ until either its local AWT falls below <fi (examined 
at time interval a), or an available machine volunteers its services. 

3.1.3 Symmetrically-Initiated 

Unlike S-I and R-I, the symmetrically-initiated (Sy-I) algorithm works in both active and passive modes. 
As in the R-I strategy, each machine periodically checks its own resource usage and broadcasts a message 
to its partner set if it is underutilized. The difference occurs when the local approximate wait time of a 
job exceeds <p but no underutilized machine volunteers its services. In the R-I approach, the job passively 
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sits in the GQ while waiting for a volunteer, and periodically checks its local AWT at each a time interval. 
However, the Sy-I algorithm immediately switches to active mode and sends a request to its partners using 
the S-I strategy. The main differences in the three job migration algorithms therefore lie in the timing of the 
job transfer request initiations and the destination choice for those requests. 

3.2 Reference Algorithms 

We use two scheduling algorithms as reference algorithms to compare our work to. The centralized al- 
gorithm has a single grid scheduler and represents a performance target for our distributed scheduling ap- 
proaches. The local algorithm performs no job migration and represents the current non-grid scheduling 
environment 

3.2.1 Centralized 

In the centralized scheduling algorithm, all jobs are submitted to a single grid scheduler which does not 
have an affinity to a specific local system. The GS is responsible for making global decisions and assigning 
each job to a specific machine. The GS tracks the status of each job and maintains up-to-date information 
on all available resources, allowing it to compute the turnaround cost directly, without the need for any 
communication. When a job arrives, the GS computes its TC for all systems, selects the one with the 
minimum TC, and immediately migrates the job to that system. Although communication-free resource 
awareness is an unrealistic assumption, it allows us to model the potential gain of a centralized architecture. 
However, it constitutes a single point of failure and thus suffers from a lack of reliability and fault tolerance. 
Additionally, this approach has severe scalability problems that may result in a performance bottleneck for 
large-scale grid environments. 

3.2.2 Local 

In the local scheduling algorithm, there are no grid schedulers. All jobs are submitted to local schedulers 
and execute on the compute server associated with each local scheduler. This approach represents how 
scheduling is currently being performed and we use it as a way to demonstrate the benifits of grid scheduling 
algorithms. 

4 Methodology 

We evaluate our grid scheduling algorithms using simulations of resources and jobs. We simuluate the 
submission of workloads of jobs to grid schedulers, the operation of grid and local schedulers, the transfer 
of job input and output data between compute servers, and the execution of jobs on compute servers. During 
these simulations, we gather performance information so that we can compare the various grid scheduling 
algorithms. 

4.1 Resource Configurations 

We simulate 7 different compute servers in our simulations. These systems have the identical characteristics 
as those located at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berke- 
ley National Laboratory, the NASA Advanced Supercomputing Division at NASA Ames Research Center, 
the Lawrence Livermore National Laboratory, and the San Diego Supercomputer Center. These systems 
are all parallel computers some of which consist of cache-coherent Symmetric Multiprocessor (SMP) nodes 
interconnected by a fast proprietary network others of which are NonUniform Memory Access (NUMA) 



Server 

Identifier 

Number 
of Nodes 

CPUs 
per Node 

CPU Speed 
(MHz) 

S 

(3 site) 

ite Identif 
(6 site) 

ier 

(12 site) 

Si 

184 

16 

375 

0 

0 

0 

s 2 

305 

4 

332 

1 

1 

1 

s 3 

144 

8 

375 

2 

3 

2 

s 4 

1024 

4 

600 

i 

0 

3 

S 5 

64 

2 

250 

2 

2 

4 

S 6 

512 

4 

400 

2 

5 

5 

s 7 

128 

2 

250 

2 

5 

6 

Ss 

144 

8 

375 

i 

i 

2 

" -7 '1 

/ 

s 9 

1024 

4 

600 

0 

4 

8 

Sio 

64 

2 

250 

6 

i 

9 

s 11 

512 

4 

400 

0 

3 

10 

5 12 

128 

' 2~ 

250 

l 

4 

11 


Table 1: Configurations of the computational servers and assignment to sites when there are 3 sites, 6 sites, 
or 12 sites. 


shared memory systems also connected by a fast proprietary network. Both types of systems partition CPUs 
into nodes for management purposes and the current practice is to allocate each node to a single applica- 
tion so that applications do not interfere with each other. We therefore used this allocation approach in our 
simulation environment. 

We want to use 12 compute servers to give us more options for splitting servers into sets, so we dupli- 
cated 5 of the 7 compute servers to produce a total of 12. We then split the systems into 3, 6, and 12 sets to 
simulate compute servers grouped into 3, 6, or 12 machine rooms at different sites. Each set has an equal 
number of machines and we attempted to make the computational power in each set as equal as we could. 
The characteristics of these systems and the sites to which they are assigned are shown in Table 1. 

We also simulate the networks connecting the compute servers. We assume that all of the compute 
servers at a single site share a network and that each of these networks is connected to every other site 
network using a point-to-point network connection. When we simulate the transfer of data for a job, we 
simulate the use of a site network on the sending side, a point-to-point network, and a site network on 
the receiving side. Any of these three networks can constrain the end-to-end data transfer bandwidth. We 
assume that all data transfers using a network share the network bandwidth equally. We perform simulations 
using two different assumptions about available network bandwidth. First, we assume that 800 Mb/s is 
available from each site network and 40 Mb/s is available from each point-to-point network. This represents 
a gigabit ethemet site network and a relatively high performance Wide Area Network (WAN). Second, we 
assume that 80 Mb/s is available from each site network and 4 Mb/s is available from each point-to-point 
network. This represents a 100 megabit ethemet site network and a somewhat slower WAN. 

For the experiments in this paper, we make two simplifying assumptions. First, we assume that program 
performance is linearly related to CPU speed. Second, even though the systems we are simulating are not 
all binary compatible, we assume that users have compiled their applications for each of the heterogeneous 
platforms. We plan to relax both of this assumptions in future work. 
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Table 2: Characteristics of the workloads used in our performance comparison. Workload W{ is submitted 
to the grid scheduler on system Si during our simulations. 


4.2 Workloads 

We base our workloads on trace data obtained from schedulers on 7 compute servers. Seven traces, one 
from each system, were recorded from March of 2002 through May of 2002. Five traces were gathered 
from 5 of die same 7 systems but recorded from September of 2002 through November of 2002. These 12 
traces do not include information on how much input data is used by each job and how much output data is 
produced by each job because this data is not typically available to local schedulers. So, we added synthetic 
information about input and output data sizes to each job in the workloads. 

When adding input and output data sizes to the jobs, we assume that the amount of this data is correlated 
to the amount of work (number of CPUs multiplied by amount of wallclock run time) performed by each job. 
We also add a random element to calculating this data so we set the amount of input data for a job j using a 
Gaussian distribution with a mean fij — b * cpusj *walltimesecondsj and a standard deviation of <jj = ^ 
where b is the amount of bytes for each unit of work the job performs. Using anecdotal observations, our 
best estimate for b is 1,000 bytes for each CPU second the application executes. For comparison, we also 
create workloads assuming that b is 100 and 10,000. We refer to these workloads, creatively, as small data, 
medium data, and large data. In all cases, we assume that the output data size is 5 times as large as the input 
data size calculated using one of the previous methods. 

The characteristics of our workloads are shown in Table 2. 

4.3 Performance Metrics 

We use several key metrics in our simulations to evaluate the effectiveness of our proposed grid scheduling 
architecture and job migration algorithms. These metrics are also used to compare performance with local 
and centralized job scheduling schemes. 

Since individual users and system administrators often have different (and possibly conflicting) de- 
mands, no single measure can comprehensively capture overall grid performance. From the users’ perspec- 
tive, key measures of grid performance include the Average Response Time and the Average Wait Time . 
These are computed as follows (N is the total number of jobs): 
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Average Response Time 
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where SubmitTimej , StartTimej, and EndTimej are the times when job j is submitted to the queue, when it 
commences execution, and when it is completed. The response (or turnaround) time is probably the single 
most important measure for an individual submitting a job; however, the wait time is also critical to users 
even though it is usually beyond their control. 

A system administrator (or funding agency), on the other hand, is more interested in maximizing the uti- 
lization of the available computational resources at his/her center. Thus, we present the Weighted Utilization 
metric, which measures the overall ratio between consumed and available computational resources across a 
grid. It is computed as: 


Weighted Utilization = 


YljeJobs {EndTimej - StartTimej ) x CPUsj x CPUSpeedj 
{EndTime iast _j ob - SubmitTime first _ job ) x me Servers CPUs m * CPUSpeed m 


xl00% 


where (EndTimei ast _j 0 b— SubmitTime fi rst _job) is the duration of the entire simulation; CPUsj and CPUSpeedj 
are the number of processors used by job j and their clock speed; and CPUs m and CPUSpeedm are the num- 
ber of processors in machine m and their clock speed. Individual site-specific system utilizations are also 
reported to understand the effects of superscheduling on local computational centers. 

The metric Fraction of Jobs Transferred allows us to determine if there is any relationship between the 
number of jobs transferred and the performance of the scheduling algorithms. This metric is defined as: 


Fraction of Jobs Transferred 


Number of Jobs Transferred 
Total Number of Jobs 


Finally, we use the total volume of data transferred to help determine if the amount of data transferred 
by a scheduling algorithm is affecting it’s performance. This metric is defined as: 


Data Volume = (InputDataSize j 4- OutputDataSize j) 

jeJobs 

Note that performance, measured by any metric, is highly dependent on the workload requirements. For 
example, we would not expect an underloaded system to derive much benefit from a superscheduler in terms 
of grid efficiency, as there may not be much room for improvement. 


5 Results 

This section presents and analyzes the simulation results of our job migration algorithms using the perfor- 
mance metrics described in Section 4.3. 

To begin, we compare the performance of our sender-initiated, receiver-initiated, and symmetrically- 
initiated job migration algorithms. The performance data for these algorithms is shown in Figure 2. One 
of the most important metrics is the average response time because that is ultimately what users care about. 
We find that the S-I algorithm has the lowest average response time and this response time is 5.5% less 
than the response times resulting from the other two algorithms. This response time is 47% better than the 
response time if only local scheduling is performed and is only 0.4% worse than the response time of the 
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Figure 2: Comparison of the performance of our migration techniques. 


centralized algorithm. We also find that the average wait times correlate with the response times: S-I has 
the lowest average wait time (62% less than the wait times of R-I and Sym-I) and this wait time is 34% less 
than the average wait time of the centralized algorithm. These differences are much more significant than 
the response time differences, but do not end up being significant because, on average, the wait time is only 
6% of the response time. 

Figure 2 also shows the average amount of data moved for each job. We find that the average data 
volume does not correlate with the response times: The local algorithm moves the least data and has the 
worst response times. The receiver-initiated and symmetrically-initiated algorithms move the next least 
amount of data and have the next highest response times. We do find that the centralized algorithm moves 
less data then the sender-initiated algorithm and also has lower response times. 

A final observation is that the weighted average utilization is identical (53%) for our algorithms. This 
is because the jobs are submitted over time, rather than all at once, and the resource utilization obtained by 
even the best scheduling algorithm limited by the amount of work that is submitted to it. 

The data presented in Figure 2 allows us to begin examining the effect of decreasing network bandwidth 
by a factor of 10. We find that this does not have a significant impact on response time or wait time but it 
does it does result in a 46% reduction in the amount of data transferred over the network and that 88% more 
jobs are executed on servers in the same site to which the they are submitted. This shows that our migration 
algorithms are adapting, and adapting well, to the decrease in network bandwidth. If we perform a more 
detailed examination and examine the performace for our small, medium, and large data workloads, w ; e 
do see the effects of decreasing network bandwidth. This information for the sender-initiated algorithm is 
shown in Figure 3. The data shows that the large data workloads have a significant 15% increase in response 
time when network bandwidth is lowered while the medium data workloads only has a 2% increase and 
there is virtually no increass in response time for the small data workloads. 

We can also use Figure 3 to examine the effect of increasing the amount of data transferred per job. 
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Figure 3: Performance of our sender-initiated algorithm when varing the amount of data per job. 

We find that it does have a small effect when we simulate the higher network bandwidths, but the effect is 
more significant for lower network bandwidths. We find that the reducing data sizes from medium to small 
results in a reduction in response time by 2% while increasing data sizes from medium to large results in an 
increase in response time by 15%. 

We next examine the effect of the number of sites the compute servers are grouped in to (data not shown). 
Contrary to our intuition, we find that the number of sites the servers are grouped in is not significant. 
When we examine the performance of the sender-initiated algorithm, we find that even with our large data 
workloads and lowest network bandwidth, having 6 sites actually reduces the average response time by 0.2% 
over having 3 sites and having 12 sites only increases response time over 3 sites by 0.2%. 

Finally, we find that it is important to consider the size and placement of input and output data as well as 
the available network bandwidth when making migration decisions. We performed simulations of versions 
of our migration algorithms that do not consider the transfer of input and output data when making decisions. 
We have not completed these simulations, but for the symmetrically-initiated algorithm and workloads with 
large data sizes, ignoring data transfers when making decisions results in only a 4% increase in transfer times 
when using faster networks but results in a 690% increase in response times when using slower networks. 

6 Conclusions and Future Work 

One of the primary goals of grid computing is to share access to geographically distributed heterogeneous re- 
sources in a transparent manner. There will be many benefits when this goal is realized, including the ability 
to execute applications whose computational requirements exceed local resources and the reduction of job 
turnaround time through workload balancing across multiple computing facilities. We address this problem 
by defining a grid scheduling architecture and job migration algorithms and evaluating their performance. 
Our grid scheduling architecture is a peer-to-peer architecture which, we believe, is the architecture that 
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will provide the best scalability and fault tolerance. Further, our architecture leaves local schedulers in place 
and therefore does not take over control of local resources. 

We propose three job migration algorithms; the sender-initiated algorithm sends jobs from overloaded 
compute servers to less loaded servers, the receiver-initiated algorithm requests jobs from underloaded com- 
pute servers, and the symmetrically-initiated algorithm uses a combination of both approaches. Our exper- 
iments show that our sender-initiated algorithm has over 5% lower turn-around times than our others and 
that these times are 47% less than if no grid scheduling is performed. Further, we find that for our experi- 
ments with larger data sizes and lower network band widths, ignoring data transfers when making migration 
decisions can result in 690% higher turn-around times. 

There are several areas of future work that we plan to explore. We wish to study how our grid scheduling 
scales to a large number of grid schedulers, including addressing problems such as how grid schedulers find 
peers. We plan to relax assumptions such as performance being related only to CPU speed and that eveiy 
application can execute on every system. We also wish to compare our grid scheduling approach to others 
such as hierarchical and when grid and local schedulers are combined. 
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